We are tasked with predicting which day of the week our customers will visit again. This notebook details the steps taken to solve this problem. We get a dataset with weekly visits of 300000 customers. Unfortunately the structure is not ready for machine learning so a lot of data preperation needs to be done. Once the structure is fixed, we do some feature engineering. Finally, using random forests we can predict which day of week a customer will visit.
Dataset (https://drive.google.com/open?id=1IhZUJa9r44SAhZQQOLB7wEqdVtx6Qtuq) provides information about shopping mall visits. Each line represents one customer - the first column contains unique customer identifier and the second column contains indices of the day when customer have visited the mall. The day with index 1 is a Monday (e. g. 7th is a Sunday, 8th is again a Monday). Indices are within a range of 1 to 1001 (which is equal to 143 full weeks). The task is to predict the first day of the next visit (week 144). For example, if customer will visit mall on Wednesday, then prediction should be equal to 3:
There are many machine learning algorithms which can be used for classification problems. I decided on using Random forests because they are easy to implement, give feature importance and can be top performers if tuned correctly.
We use binary classification models to predict the visit probability for each day of week independantly. Hence, the accuracy for each model varies:
Reporting these high classification accuracies can be misleading. For example: Our Day 1 predictions, we get around 90% accuracy because our model is predicting all values will be 0 i.e. no customers are predicted to visit on Monday. Because there are so few monday visits (in comparison to other days and no visits) this is not neccesarily a bad prediction. If our task was to predict Mondays more accurately there are other evaluation criteria we could use to force the model to predict more visits as Monday. However, I decided it would be more important to predict accurately overall and hence left these models as they are.
To get our final prediction we can predict the day of week which has the highest probability for all 8 models. We could also structure this as a multi-label classification problem and try to correctly predict all visit days in the next week. This decision is dependant on how our model will be used for business applications.
As I mention above, random forest give us the feature importance. For every model, the most important feature to predict the day of week was the proportion of days the customer previously visited on that day. This also makes intuitive sense, for example: If a customer has visited the mall a total of 10 times, and 9 out of the 10 times they visit on a Monday, it is reasonable to predict the next time they visit will be on Monday.
The other very important feature is the proportion of no visit weeks. Again, this makes sense: If we have 2 customers - A and B - Customer A visits at least 1 day a week for 10 weeks while customer B visits only once in 10 weeks. It is reasonable to predict customer A is more likely to visit the next week compared to customer B.
I built 2 Tableau dashboards to help explain what is happening in this model.
The following steps are taken to prepare our data for machine learning:
To solve this classification problem we are going to:
Our final models are giving reasonable results. However, they are not doing well at predicting rare events. To solve this we will need to do additional feature engineering. We could try other modelling approaches. We should try to collect more metadata like rainfall on spesific days or whether it was a public holidays / promotions in the mall.
Given that there are some time constraints, I didn't have enough time to try many different models or do more feature engineering. I would really like to try using deep learning to solve this classification problem. Deep learning can be really good at solving problems that require a lot of feature engineering.
As a final comment. I enjoyed working on this. I hope you enjoy reading and going through this notebook. If you have any comments or questions, you can get in touch through email: ryanfras@gmail.com
Now let's begin :).
First we need to import the data to see what the structure looks like
import pandas as pd
import numpy as np
LOC = 'C:\\Users\\ryanf\\OneDrive\\Data\\Customer Visits (Day of week)\\'
FILE = 'train_set.csv'
def display_all(df):
with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000):
display(df)
df_raw = pd.read_csv(LOC + FILE)
df_raw.head()
The visits column stores the index in a space seperated format. Before we can do any machine learning we will first need fix this.
df_raw.info()
From the info statement we can see that visits is being stored as an object. We can also see there is a space before visits columns name so let's clean that. Since each row represents 1 customers we know we are working with a dataset containing 300000 customers
df_raw.columns = df_raw.columns.str.strip()
Now we can focus on restructuring the visits to a more appropriate format.
visits_list = (df_raw['visits']
.str.strip() # remove leading and trailing white splace
.str.split(' ') # split the visits into a python lists
)
Lets look at the first customer
print('Number of visits: ' + str(len(visits_list[0])))
print(visits_list[0])
Although this format is better, we want to predict if and when a customer will visit. The when part is easier to answer if we have a timeseries format
Let's solve the transformation for 1 customer then apply for all customer.
c1 = pd.DataFrame({'visitor_id': df_raw['visitor_id'][0],
'visit_day' : visits_list[0]})
c1.head()
This works so we can now apply it for the other 300000 customers. I will try a loop first
%%time
df_clean = pd.DataFrame(columns=['visitor_id', 'visit_day'])
for i in range(0,3000):
c = pd.DataFrame({'visitor_id': df_raw['visitor_id'][i],
'visit_day' : visits_list[i]})
df_clean = df_clean.append(c, ignore_index = True)
It takes around 30 seconds to do 3000 customers. So 300000 customers is going to take around 50 min. I'm too impatient to wait that long so lets try to optimimize
We will start again from where we split the visits column. This time we will use expand = True to make a dataframe
visits_df = (df_raw['visits']
.str.strip() # remove leading and trailing white splace
.str.split(' ', expand = True) # split the visits into a python lists
)
len(visits_df)
Since our index has stayed the same we can merge this data to the original data
%%time
df_raw1 = (df_raw.merge(visits_df, right_index = True, left_index = True) # merge the original data with the expanded visits
.drop(['visits'], axis=1) # drop the origil visits columns
.melt(id_vars = ['visitor_id'], value_name = "visit_day") # transform from wide format to long format
.drop("variable", axis = 1) # drop variable column which was added by melt function
.dropna() # drop the missing entries
)
Much better, we transformed all 300000 customers in under 1 min instead of waiting 50 min
len(df_raw1.loc[df_raw1['visitor_id'] == 1])
df_raw1.loc[df_raw1['visitor_id'] == 1].head()
Great, we have the same entries as the original data but now in long format.
df_raw1.info()
visit_day is stored as object but it contains integer values so lets convert them
df_raw1['visit_day'] = df_raw1['visit_day'].astype('int')
df_raw1.info()
Now that our raw data is in an appropriate format we can sort and reset index
df_raw1 = (df_raw1.sort_values(['visitor_id', 'visit_day'])
.reset_index(drop = True) # drop the old index
)
df_raw1.head()
We can store this raw data into a feather file. That way if anything goes wrong we can start again from here. Think of it as a checkpoint.
df_raw1.to_feather(LOC + 'df_raw')
Delete all the unused object from memory. My laptop only has 16 GB of memory so we will need to be memory efficient throughout
del visits_df
del visits_list
del df_raw
del df_raw1
Now the data is in a more analytics friendly format. Next, we can start doing some Exploratory Data Analytics (EDA) & feature engineering.
df = pd.read_feather(LOC + 'df_raw')
Convert the types to reduce memory usage
df['visitor_id'] = df['visitor_id'].astype(np.uint32)
df['visit_day'] = df['visit_day'].astype(np.uint32)
df.info()
%matplotlib inline
df['visit_day'].hist(bins = 1001, figsize=(18, 5))
We can see there is seasonality in this timeseries. This is likely weekends. Lets explore that theory by adding the day of week to the dataset
def add_day_of_week(index):
return (index) % 7
df.loc[df['visitor_id'] == 4].head() # Visitor 4 visited the first day so we can validate whether day_of_week = 1 (Monday)
df['day_of_week'] = df['visit_day'].apply(add_day_of_week) # add new column for day of week
df.loc[df['day_of_week'] == 0, 'day_of_week'] = 7 # set all the 0's to 7 (Sunday)
df['day_of_week'] = df['day_of_week'].astype(np.uint8) # convert to uint8 to save memory
df.loc[df['visitor_id'] == 4].head()
df['day_of_week'].value_counts()
It seems that our hypothesis was correct and that the weekends (6 & 7) are the most popular days to visit the mall
Next we create week_number
df['week_number'] = (np.floor((df['visit_day']-1) / 7) + 1).astype(np.uint32) # Through some trail and error I found this works
df.info()
Export to feather to use later
df.to_feather(LOC + 'df')
To solve this classification problem we are going to:
df = pd.read_feather(LOC+'df')
dow_dummies = pd.get_dummies(df['day_of_week'], prefix='dow')
Now we join back dummies to original data
df = df.merge(dow_dummies, left_index = True, right_index = True)
df.head()
df.info()
%%time
df_w = df.groupby(['visitor_id', 'week_number']).agg({'dow_1':sum,
'dow_2':sum,
'dow_3':sum,
'dow_4':sum,
'dow_5':sum,
'dow_6':sum,
'dow_7':sum
})
df_w.head()
df_w = df_w.reset_index()
df_w.info()
Currently our dataset only contains weeks in which a customer made a visit. We need to add additional weeks to the data to represent those weeks where no visit happaned. We can do this by doing a full outer join on a full range of weeks for every customer
The next step creates the full range. We do this by:
visitor_ids = pd.Series(range(1, 300001))
visitor_ids_rep = visitor_ids.repeat(143).reset_index(drop=True)
full_range = pd.DataFrame({'visitor_id': visitor_ids_rep})
full_range['record'] = 1
full_range['week_number'] = full_range.groupby('visitor_id')['record'].cumsum()
full_range['visitor_id'] = full_range['visitor_id'].astype(np.uint64)
full_range['week_number'] = full_range['week_number'].astype(np.uint64)
full_range.info()
Do the full outer join on the weekly data
%%time
df_w1 = df_w.merge(full_range[['visitor_id','week_number']],
left_on=['visitor_id','week_number'],
right_on=['visitor_id','week_number'],
how = 'outer')
df_w1 = df_w1.sort_values(['visitor_id', 'week_number'])
df_w1.info()
Notice that our dataframe now contains 43 million records. Previously it only had 23.7 million. The new records represent non visits. They are all set to NaN values so we need to impute this with 0
df_w1 = df_w1.fillna(0)
Next we create a new column to represent the total number of visits in a spesific week
dow_cols = df_w1.columns[df_w1.columns.str.contains('dow')]
df_w1['total_visits_in_week'] = df_w1[dow_cols].sum(axis=1)
df_w1.tail()
df_w1.info()
Downcast the values stored as float64 to uint8 to save memory
df_w1_float = df_w1.select_dtypes(include=['float']).columns
df_w1[df_w1_float] = df_w1[df_w1_float].apply(pd.to_numeric, downcast='unsigned')
df_w1.info()
Nice, the memory usage reduces from 3.5 GB to 1.3 GB
df_w1 = df_w1.reset_index(drop=True)
Let's store this as a feather file to save our progress
df_w1.to_feather(LOC + 'df_w')
And del all the unused object from memory
del df
del df_w
del full_range
del visitor_ids
del visitor_ids_rep
We can create the frequency of visits and a spesific point in time by taking the cumulative sum of total_visits_per_week for each visitor
df_w1['freq'] = df_w1.groupby('visitor_id')['total_visits_in_week'].cumsum().astype(np.uint32)
We will remove all the weeks where the customer freq = 0 i.e. the first visit hasn't happened yet
df_w2 = df_w1.loc[~(df_w1['freq'] == 0)].reset_index(drop = True)
df_w2.shape
df_w2.info()
del df_w1 save some memory :)
del df_w1
df_w2.tail(5)
We create a column which tells us whether a visit (on any day) happaned
df_w2['any_visit_ind'] = (df_w2['total_visits_in_week'] > 0).astype(np.uint8)
df_w2.head()
The rows which represent no visit is just 1 - any_vist_ind
df_w2['dow_0'] = 1 - df_w2['any_visit_ind']
df_w2.info()
Your previous visit will probably effect your next visit so lets get weeks_since_previous_visit
%%time
weeks_since_prev_visit = []
for r in df_w2['any_visit_ind']:
if r == 1.0:
c=1.0
else:
c+=1.0
weeks_since_prev_visit.append(c)
weeks_since_prev_visit = pd.DataFrame({'weeks_since_prev_visit':weeks_since_prev_visit})
df_w2 = pd.concat([df_w2, weeks_since_prev_visit], axis=1)
df_w2['weeks_since_prev_visit'] = df_w2['weeks_since_prev_visit'].astype(np.uint32)
df_w2.head(5)
df_w2.info()
df_w2.to_feather(LOC + 'df_w2')
Get the total number of visits for a spesific day_of_week. The reasoning for this feature: Someone who has visited on Monday often is likely to visit on Monday again.
%%time
df_w2['tot_dow_0'] = df_w2.groupby('visitor_id')['dow_0'].cumsum().astype(np.uint32)
df_w2['tot_dow_1'] = df_w2.groupby('visitor_id')['dow_1'].cumsum().astype(np.uint32)
df_w2['tot_dow_2'] = df_w2.groupby('visitor_id')['dow_2'].cumsum().astype(np.uint32)
df_w2['tot_dow_3'] = df_w2.groupby('visitor_id')['dow_3'].cumsum().astype(np.uint32)
df_w2['tot_dow_4'] = df_w2.groupby('visitor_id')['dow_4'].cumsum().astype(np.uint32)
df_w2['tot_dow_5'] = df_w2.groupby('visitor_id')['dow_5'].cumsum().astype(np.uint32)
df_w2['tot_dow_6'] = df_w2.groupby('visitor_id')['dow_6'].cumsum().astype(np.uint32)
df_w2['tot_dow_7'] = df_w2.groupby('visitor_id')['dow_7'].cumsum().astype(np.uint32)
display_all(df_w2.head())
df_w2.info()
We can get the day_of_week proportion of their total visits by just dividing with freq. Reasoning: If someone has visited 10 times and 9 out of 10 were on Sunay, they are likely to visit again on a Sunday. Non visits is calculated slightly differently because we want the proportion of non visits of out the full timerange
df_w2['prop_dow_0'] = pd.to_numeric(df_w2['tot_dow_0'] / (df_w2['freq'] + df_w2['tot_dow_0']), downcast = 'float')
df_w2['prop_dow_1'] = pd.to_numeric(df_w2['tot_dow_1'] / df_w2['freq'], downcast = 'float')
df_w2['prop_dow_2'] = pd.to_numeric(df_w2['tot_dow_2'] / df_w2['freq'], downcast = 'float')
df_w2['prop_dow_3'] = pd.to_numeric(df_w2['tot_dow_3'] / df_w2['freq'], downcast = 'float')
df_w2['prop_dow_4'] = pd.to_numeric(df_w2['tot_dow_4'] / df_w2['freq'], downcast = 'float')
df_w2['prop_dow_5'] = pd.to_numeric(df_w2['tot_dow_5'] / df_w2['freq'], downcast = 'float')
df_w2['prop_dow_6'] = pd.to_numeric(df_w2['tot_dow_6'] / df_w2['freq'], downcast = 'float')
df_w2['prop_dow_7'] = pd.to_numeric(df_w2['tot_dow_7'] / df_w2['freq'], downcast = 'float')
df_w2.info()
Finally we need to shift all the features to the next time period. When we are predicting we are going to use the previous week information to predict this week
df_w2.to_feather(LOC + 'df_w2')
df_w2 = pd.read_feather(LOC + 'df_w2')
Get a list of features to shift
features_tot_dow = list(df_w2.columns[df_w2.columns.str.contains('tot_dow')].values)
features_prop_dow = list(df_w2.columns[df_w2.columns.str.contains('prop_dow')].values)
features_other = ['freq', 'weeks_since_prev_visit']
features = []
features.extend(features_tot_dow)
features.extend(features_prop_dow)
features.extend(features_other)
features
df_w2[features] = df_w2.groupby('visitor_id')[features].shift(1)
df_w2.info()
Again we need to downcast all the float64 values
df_w2_float = df_w2.select_dtypes(include=['float']).columns
df_w2[df_w2_float] = df_w2[df_w2_float].apply(pd.to_numeric, downcast='float')
display_all(df_w2.head())
We have 42 mil records so I will just remove the 300000 missing values. These are the first visits for every customer
df_w2 = df_w2.loc[~(df_w2['freq'].isnull())].reset_index(drop = True)
display_all(df_w2.head())
df_w2.to_feather(LOC + 'df_w3')
#del df_w
#del df_w1
del df_w2
Finally, we can start the fun part. I decided to use random forest to solve each binary classification. Why only RF? RF is a fairly robust algorithm which has proven to be a top performer in most modelling problems. I also do not have time to try many different models.
df = pd.read_feather(LOC + 'df_w3')
df.shape
For the training set we will use the weeks spanning 130-140. These weeks are closest to the test and validation sets. The valid, test and final test will be the last 3 weeks in our data. I didn't need to make test_final but decided it would be better to have another test to make sure the model is generalizing well
test_final = df.loc[df['week_number'] == 143]
test = df.loc[df['week_number'] == 142]
valid = df.loc[df['week_number'] == 141]
train = df.loc[df['week_number'].isin(range(130, 141))]
del df
print(test.shape)
print(test_final.shape)
print(valid.shape)
print(train.shape)
features_tot_dow = list(df.columns[df.columns.str.contains('tot_dow')].values)
features_prop_dow = list(df.columns[df.columns.str.contains('prop_dow')].values)
features_other = ['freq', 'weeks_since_prev_visit']
features = []
features.extend(features_tot_dow)
features.extend(features_prop_dow)
features.extend(features_other)
I have removed the below features after investigating the plots in the next section.
features.remove('tot_dow_0') # correlated with prop_dow_0
features.remove('freq') # correlated with prop_dow_0
targets = ['dow_0', 'dow_1', 'dow_2', 'dow_3', 'dow_4', 'dow_5', 'dow_6', 'dow_7']
X_train_full = train[features]
y_train_full = train[targets]
X_valid = valid[features]
y_valid = valid[targets]
X_test = test[features]
y_test = test[targets]
X_test_final = test_final[features]
y_test_final = test_final[targets]
display_all(X_train_full.tail())
y_train_full.tail()
from sklearn.preprocessing import StandardScaler
sample_index = X_train_full.sample(300000).index
#X_train = X_train_full.loc[sample_index]
#y_train = y_train_full.loc[sample_index]
X_train = X_train_full
y_train = y_train_full
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)
X_test_final = scaler.transform(X_test_final)
We removed tot_dow_0 and freq because they are highly correlated to prop_dow_0. RF's are actually very good at handling multicolinearity because of the max_features argument so we probably didn't need to do this. However, if other models are tried in the future. It is better to not have these features together.
import scipy
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy as hc
corr = np.round(scipy.stats.spearmanr(X_train).correlation, 4)
corr_condensed = hc.distance.squareform(1-corr)
z = hc.linkage(corr_condensed, method='average')
fig = plt.figure(figsize=(16,10))
dendrogram = hc.dendrogram(z, labels=X_train.columns, orientation='left', leaf_font_size=16)
plt.show()
import seaborn as sns
correlation = X_train.corr()
plt.figure(figsize=(10, 10))
sns.heatmap(correlation, vmax=1, square=True, annot=True, cmap='cubehelix')
%matplotlib inline
# take a 5% sample as this is computationally expensive
df_sample = X_train.sample(frac=0.01)
# Pairwise plots
sns.pairplot(df_sample)
We will be looking at scores all the time to compare different parameter settings so it is much easier to wrap this in a function
import math
def rmse(x,y): return math.sqrt(((x-y)**2).mean())
from sklearn import metrics
def print_score(m, dow):
res = [rmse(m.predict(X_train), y_train[dow]),
rmse(m.predict(X_valid), y_valid[dow]),
m.score(X_train, y_train[dow]),
m.score(X_valid, y_valid[dow])]
if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
print(res)
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import forest
print(X_train.shape)
print(y_train.shape)
train.week_number.unique()
Setting the subsampling parameter in RF's will help reduce overfitting and increase training time
def set_rf_samples(n):
""" Changes Scikit learn's random forests to give each tree a random sample of
n random rows.
"""
forest._generate_sample_indices = (lambda rs, n_samples:
forest.check_random_state(rs).randint(0, n_samples, n))
I tried a couple of settings for each random forest parameter and found below performed the best. We could probably spend some more time tuning to get slightly better results.
set_rf_samples(300000)
%%time
target = 'dow_1'
rf_classifier = forest.RandomForestClassifier(n_estimators = 80,
max_features=0.1,
min_samples_leaf=3,
n_jobs = -1,
oob_score=True,
class_weight='balanced'
)
rf_classifier.fit(X_train, y_train[target])
print_score(rf_classifier, target)
It takes around 3 min for 1 model. So doing all 7 will take around 20 min
probs = pd.DataFrame(rf_classifier.predict_proba(X_test))
rf_roc_auc = metrics.roc_auc_score(y_test, probs)
rf_roc_auc
rf_classifier.predict_proba(X_test)[0:5]
rf_pred = rf_classifier.predict(X_test)
rf_pred
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
print(accuracy_score(y_test[target], rf_pred))
cm = confusion_matrix(y_test[target], rf_pred)
print(cm)
def get_feature_importance(m):
feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(features, m.feature_importances_):
feats[feature] = importance #add the name/value pair
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Importance'})
importances = importances.sort_values(by='Importance')
return importances
get_feature_importance(rf_classifier).plot(kind='barh', figsize=(12,7))
rf_models = {}
for target in targets:
rf_classifier = forest.RandomForestClassifier(n_estimators = 80,
max_features=0.1,
min_samples_leaf=3,
n_jobs = -1,
oob_score=True,
class_weight='balanced')
rf_classifier.fit(X_train, y_train[target])
print(target)
print_score(rf_classifier, target)
rf_models[target] = rf_classifier
rf_models
for dow, m in rf_models.items():
get_feature_importance(m).plot(kind='barh', figsize=(12,5), title=dow)
train_probs = pd.DataFrame(columns=targets)
valid_probs = pd.DataFrame(columns=targets)
test_probs = pd.DataFrame(columns=targets)
test_final_probs = pd.DataFrame(columns=targets)
for target in targets:
# Compute probability of observation being in the origin.
train_probs[target] = rf_models[target].predict_proba(X_train)[:,1]
valid_probs[target] = rf_models[target].predict_proba(X_valid)[:,1]
test_probs[target] = rf_models[target].predict_proba(X_test)[:,1]
test_final_probs[target] = rf_models[target].predict_proba(X_test)[:,1]
train_viz = pd.concat([train.reset_index(drop=True), train_probs], axis=1)
valid_viz = pd.concat([valid.reset_index(drop=True), valid_probs], axis=1)
test_viz = pd.concat([test.reset_index(drop=True), test_probs], axis=1)
test_final_viz = pd.concat([test_final.reset_index(drop=True), test_final_probs], axis=1)
train_viz['set'] = 'train'
valid_viz['set'] = 'valid'
test_viz['set'] = 'test'
test_final_viz['set'] = 'test_final'
all_viz = pd.concat([train_viz, valid_viz, test_viz, test_final_viz], axis=0)
all_viz.shape
all_viz.to_csv(LOC + 'all_viz.csv', index=False)
test[features].tail()
test[targets].tail()
test_probs.tail()
predicted_day = test_probs.idxmax(axis=1)
predicted_day.head()
y_test.reset_index(drop=True).tail()
There are many machine learning algorithms which can be used for classification problems. I decided on using Random forests because they are easy to implement, give feature importance and can be top performers if tuned correctly.
We use binary classification models to predict the visit probability for each day of week independantly. Hence, the accuracy for each model varies:
Reporting these high classification accuracies can be misleading. For example: Our Day 1 predictions, we get around 90% accuracy because our model is predicting all values will be 0 i.e. no customers are predicted to visit on Monday. Because there are so few monday visits (in comparison to other days and no visits) this is not neccesarily a bad prediction. If our task was to predict Mondays more accurately there are other evaluation criteria we could use to force the model to predict more visits as Monday. However, I decided it would be more important to predict accurately overall and hence left these models as is.
To get our final prediction we can predict the day of week which has the highest probability for all 8 models. We could also structure this as a multi-label classification problem and try to correctly predict all visit days in the next week. This decision is dependant on how our model will be used for business applications.
As I mention above, random forest give us the feature importance. For every model, the most important feature to predict the day of week was the proportion of days the customer previously visited on that day. This also makes intuitive sense, for example: If a customer has visited the mall a total of 10 times, and 9 out of the 10 times they visit on a Monday, it is reasonable to predict the next time they visit will be on Monday.
The other very important feature is the proportion of no visit weeks. Again this makes sense: If we have 2 customers - A and B - Customer A visits at least 1 day a week for 10 weeks while customer B visits only once in 10 weeks. It is reasonable to predict customer A has a higher likelyhood of visiting the next week compared to customer B.
I built 2 Tableau dashboards to help explain what is happening in this model.
Our final models are giving reasonable results. However, they are not doing well at predicting rare events. To solve this we will need to do additional feature engineering. We could try other modelling approaches. We should try to collect more metadata like weather on spesific days or whether it was a public holidays / promotions in the mall.
Given that there are some time constraints, I unfortunatley didn't have enough time to try many different models or do more feature engineering. I would really like to try using deep learning to solve this classification problem. Deep learning can be really good at solving problems that require a lot of feature engineering.
As a final comment. I enjoyed working on this. I hope you enjoy reading and going through this notebook. If you have any comments or questions, you can get in touch through email: ryanfras@gmail.com